December 11, 2015
Raw data given as excel file with ~300 tabs, one for each patient, like this:
Along with timeseries measurements, the following static covariates were given for each patient:
As well as an outcome score called "GOS" measured at 3, 6, 12, and 24 months:
For the sake of modeling, the following assumptions were made:
There were 339 patients in raw data but only 268 in filtered dataset.
Note: Of 268 patients in filtered set, 18 had missing 3 month GOS
Frequency of non-time-dependent values:
Relationships between static variables and outcomes are pretty clear:
Timeseries measurements for 4 random patients:
Some observations on timeseries values:
Things should level off by hour 48, validating the assumption that earlier measurements are more relevant.
A big challenge involves working around the fact that this model for the data below is not valid:
glm(outcome ~ age + pbto2, family='binomial')
| age | pbto2 | uid | tsi_min | outcome |
|---|---|---|---|---|
| 29 | 0 | 636 | 540 | 0 |
| 29 | 1 | 636 | 600 | 0 |
| 29 | 2 | 636 | 660 | 0 |
| 24 | 18 | 656 | 720 | 1 |
| 24 | 16 | 656 | 1080 | 1 |
Repeating the non-time-dependent variables like this leads to confidence intervals on coefficients that are unrealistically small (and not valid).
Possible Workarounds:
All of the models that follow take approach 3, though the way they deal with the thresholds will differ.
As a first attempt at modeling, all timeseries variables were separated into ranges based on known guidelines for safe values (Primarily from this Lab Value Guide).
For example, the PaCO2 variable was split into three new variables:
PaCO2 value in [0, 35)PaCO2 value in [35, 45)PaCO2 value in [45, +Inf)For all variables, the known "safe range" was then dropped from the model to avoid linear dependence (percentages across all variables would add up to 1). This leaves only variables indicating time spent at dangerous levels.
## variable mean sd min max type ## 1 icp1_20_inf 0.04 0.13 0 1.00 GasRange ## 2 paco2_0_35 0.44 0.28 0 1.00 GasRange ## 3 paco2_45_inf 0.05 0.12 0 0.80 GasRange ## 4 pao2_0_30 0.04 0.09 0 0.43 GasRange ## 5 pao2_100_inf 0.08 0.15 0 1.00 GasRange ## 6 pbto2_0_20 0.31 0.31 0 1.00 GasRange ## 7 pbto2_100_inf 0.00 0.02 0 0.19 GasRange ## 8 pha_0_7.35 0.10 0.18 0 1.00 GasRange ## 9 pha_7.45_inf 0.30 0.27 0 1.00 GasRange ## 10 age 37.75 16.37 16 77.00 Static ## 11 gcs 5.90 1.46 3 8.00 Static ## 12 marshall 3.06 1.47 1 6.00 Static ## 13 sex 0.78 0.42 0 1.00 Static
Best model selected through exhaustive AIC search:
glmulti('gos ~ .', family='binomial', level=1)
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -1.484 | 0.198 | -7.506 | 0.000 |
| age | -0.639 | 0.195 | -3.275 | 0.001 |
| gcs | 0.542 | 0.184 | 2.939 | 0.003 |
| paco2_45_inf | 0.509 | 0.175 | 2.916 | 0.004 |
| pao2_0_30 | -0.430 | 0.202 | -2.124 | 0.034 |
| icp1_20_inf | -0.621 | 0.303 | -2.049 | 0.040 |
| pbto2_0_20 | -0.347 | 0.182 | -1.909 | 0.056 |
| pbto2_100_inf | -0.777 | 0.432 | -1.797 | 0.072 |
| marshall | -0.327 | 0.197 | -1.658 | 0.097 |
| Estimate | Nb models | Importance | +/- (alpha=0.05) | |
|---|---|---|---|---|
| paco2_0_35 | -0.045 | 37.000 | 0.294 | 0.190 |
| pao2_100_inf | -0.054 | 37.000 | 0.325 | 0.204 |
| sex | 0.067 | 37.000 | 0.342 | 0.238 |
| pha_0_7.35 | -0.115 | 44.000 | 0.427 | 0.353 |
| pha_7.45_inf | -0.115 | 47.000 | 0.454 | 0.336 |
| marshall | -0.241 | 65.000 | 0.701 | 0.450 |
| pbto2_0_20 | -0.269 | 69.000 | 0.780 | 0.420 |
| pao2_0_30 | -0.423 | 94.000 | 0.969 | 0.416 |
| icp1_20_inf | -0.620 | 95.000 | 0.977 | 0.622 |
| pbto2_100_inf | -0.748 | 99.000 | 0.996 | 0.844 |
| (Intercept) | -1.486 | 100.000 | 1.000 | 0.394 |
| age | -0.690 | 100.000 | 1.000 | 0.397 |
| gcs | 0.551 | 100.000 | 1.000 | 0.367 |
| paco2_45_inf | 0.521 | 100.000 | 1.000 | 0.394 |
A potential improvement on the linear models would involve not needing to set "threshold" ranges for blood gas measurements manually. The most basic version of this kind of model would involve a single threshold and a step function where having blood gas values on one side of that function has a different effect that the other side.
A generalization of this could allow more than one threshold as well as the possibility for gradual shifts between regions of values where effects differ.
A modified logistic model:
\[ logit(y_i) = \alpha + \beta \cdot X_i + f(G_{ij}) \]
where
\[ X_i = [Gender_i, Age_i, CommaScore_i, MarshallScore_i] \]
and
\[ f(G_i) = \frac{1}{n_i} \sum_j{ \frac{c_1}{1 + e^{-c_2(G_{ij} - c_3)}} + \frac{c_4}{1 + e^{-c_5(G_{ij} - c_6)}} } \] \[ n_i = \text{ length of timeseries for patient }i \]
The "double logistic" function definition for \(f\) allows for flexiblity in thresholds and graduated changes.
MCMC sampling model used: Model on Github
These are effect functions drawn from the priors in the model to show possibilities and biases the model begins with:
Parameter recovery after hard coding coefficient / function values for the actual (rather than simulated) data: